Initial setting

# load dependency
dbpathT='Covid-tweets-en.db'
dbpathR='Covid-reddit-en.db'
connT=dbConnect(SQLite(),dbpathT)
connR=dbConnect(SQLite(),dbpathR)

1.Introduction

As the pandemic continues, we want to find out how do people’s thoughts change. What do they care about? Do they have good feelings or bad feelings? What we have done is to build an tool to compare the relationship among keywords trending on twitter, people’s sentiment scores and COVID-19 spreading.

Our tool will serve as an exploratory tool, so it will allow users to explore the data by themselves, giving them a general idea of people’s considering trends, and helping them to solve their problems, like making the business decision or looking for a research topic.

Firstly, we have got the covid related tweet IDs and detailed tweet information. Due to big dataset, we stored them in a database by SQLite. Secondly, we did simple text mining, in this part, we have done the sentiment score calculation, keyword frequency statistics and reverse geocoding. By using the data and text mining, we draw the interactive plotting and mapping. And we have deploied a shiny app to show this.

Moreover, we get the tweets data to compare with the COVID-spreading through Reddit, which is the social news aggregation, web content rating and discussion website. There is no interference of official account, and we can get dataset easily, which is also the Us-based dataset.

2.API Tools

2.1 Twitter API

Since we only have a standard Twitter developer account which has a limitation of downloading data in the last seven days. So, we chose to download the list of tweet IDs related to COVID-19 from Kaggle, which has a huge ID number and is a json format and then used Twitter APIs to look up the details about these tweets. After data collection and cleaning through SQLite and R, we get the first database which has all the tweets data.

2.2 Bing API

When we draw the map with the number of the tweets, we need to get the geo location. So we reverse geocoding with Bing APIs in order to get the longitude and latitude, which is the location that the person creates the tweet. Then establish another database with geographic information.

The reason why use database 1. We can access to large Tweets data set faster. 2. We can get pre-processing information. 3. SQL for fast frequency calculation than R. 4. the use of SQL is more Convenient to add more data and Convenient for further analysis.

3.Sentiment score calculation

People use Twitter to share their interests and concerns. Word frequency analysis shows what topics they are interested in, and sentiment analysis focuses on how they think about it.

So we decide to measure the sentiment by a sentiment score. What we have done is we break the tweets into words firstly, then we use the sentiment lexicon dictionary to tag each of the word with its sentiment, checking whether it is positive or negative. The sentiment score for each tweet is equal to the number of positive words in this tweet minus number of negative words divided by the total number of sentiment tagged words. And the score is between -1 and 1. -1 is completely negative, 1 is completely positive, and 0 is neutral, or mixed. The sentiment score will be sent back to the database for future analysis, so each tweet has a new variable, sentiment score now in our database.

The overall sentiment score is for a bunch of tweets that contain a keyword or several keywords in a period of time. This process will proceed when we actually draw the line plots on the shiny app. So when the user choose the keywords, we use sql to get the the selected tweets and calculate their overall sentiment score, either daily or monthly. Then we use the daily or monthly overall sentiment score to draw the line plot, and see the trend of people’s sentiment for a given keyword, or keywords.

In the same way, we can get another 3 function to get data from two Databases:

Get tweets trend function

Get reddit data function

Get reddit trend function

4.Trendplot function

This function is used to define all our drawing functions.

5. Visualization

5.3 Geo Tweets map

# spread data
covid=read.csv('us_states_covid19_daily.csv')%>%
    select(date,positiveIncrease,state)%>%
    mutate(month=month(ymd(date)))%>%
    {aggregate(positiveIncrease~state+month,.,sum)}
# a group of keywords
keyword='Mask#N95#口罩'
keyword=keyword%>%str_split('#')%>%.[[1]]
# a group of data
trend=keyword%>%
    {getTwitterTrend(connT,geoinfo='state',trend='month',keywords=.,period=NULL)}%>%
    filter(country=='United States')%>%
    mutate(month=as.integer(month))
# example
geoTrendMap(covid,trend)

We conducted a plot of tweets each month, and also get some geographic information of these tweets in our dataset from January to November. We can slide the time bar to get the information of each month. When the mouse is placed in the area of this state, we can see the name of this state and the number of tweets collected in this state. Also, we can see the sentiment score in each state. For the color scale, dark red means the overall sentiment of this state’s tweets is very negative. Sky blue means that the overall sentiment of the state’s tweets is positive. The second information is that the number of infections and deaths in a specific day.